InΒ [743]:
from IPython.display import display, HTML
display(HTML("<style>.jp-Notebook {width: 70% !important; margin: auto !important;} table {margin: 0 auto !important; margin-top: 20px !important;}</style>"))

Table of Contents:ΒΆ

  • What does the dataset contain?
  • EDA
  • Training
    • KNN
    • Neural Network
    • Logistic Regression
    • Random Forest
  • Further Questions
    • Which features are the most important?
    • Messing with the dataset (aka which model is the best with a bad quality dataset?)
  • Conclusion



What does the dataset contain? ΒΆ

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
No description has been provided for this image
InΒ [Β ]:
import pandas as pd
from IPython.display import display, Markdown

data = {
    "Feature Index": [
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
    ],
    "Feature Name": [
        "ID number", 
        "Diagnosis (M = malignant, B = benign)", 
        "Radius (mean of distances from center to points on the perimeter)", 
        "Texture (standard deviation of gray-scale values)", 
        "Perimeter", 
        "Area", 
        "Smoothness (local variation in radius lengths)", 
        "Compactness (perimeter^2 / area - 1.0)", 
        "Concavity (severity of concave portions of the contour)", 
        "Concave points (number of concave portions of the contour)", 
        "Symmetry", 
        "Fractal dimension ('coastline approximation' - 1)"
    ]
}

df_features = pd.DataFrame(data)
df_features.columns = ["Feature Index", "Feature Name"]

styled_table = (
    df_features.style
    .set_caption("Breast Cancer Wisconsin Dataset Features")
    .set_table_styles([
        {'selector': 'th', 'props': [('text-align', 'left')]},
        {'selector': 'td', 'props': [('text-align', 'left')]},
        {'selector': 'caption', 'props': [('caption-side', 'top'), ('font-weight', 'bold')]}
    ])
)

display(styled_table)
Breast Cancer Wisconsin Dataset Features
Β  Feature Index Feature Name
0 1 ID number
1 2 Diagnosis (M = malignant, B = benign)
2 3 Radius (mean of distances from center to points on the perimeter)
3 4 Texture (standard deviation of gray-scale values)
4 5 Perimeter
5 6 Area
6 7 Smoothness (local variation in radius lengths)
7 8 Compactness (perimeter^2 / area - 1.0)
8 9 Concavity (severity of concave portions of the contour)
9 10 Concave points (number of concave portions of the contour)
10 11 Symmetry
11 12 Fractal dimension ('coastline approximation' - 1)



Exploratory Data Analysis ΒΆ

InΒ [715]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
InΒ [716]:
df = pd.read_csv('data.csv')
InΒ [717]:
df.describe()
Out[717]:
id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 0.0
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 NaN
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 NaN
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 NaN
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 NaN
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 NaN
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 NaN
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 NaN

8 rows Γ— 32 columns

All the relevant features are floats and there is an unnecessary feature called "Unnamed: 32" that seems to appear due to this issue with read_csv(): https://www.kaggle.com/discussions/general/354943
InΒ [718]:
df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)
df.head()
Out[718]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows Γ— 31 columns

We drop the "id" and "Unnamed: 32" columns since thery are irrelevant.
InΒ [719]:
def diagnosis_value(diagnosis):
    return 1 if diagnosis == 'M' else 0
Encode the target feature numerically.
InΒ [720]:
df['diagnosis'] = df['diagnosis'].apply(diagnosis_value)
df.head()
Out[720]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 1 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 1 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 1 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 1 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows Γ— 31 columns

Applied the numerical encoding of the target variable to the dataset.
InΒ [721]:
df.isnull().values.any()
Out[721]:
np.False_
There are no null values.
InΒ [722]:
# Plot histograms for each feature
df.hist(bins=15, figsize=(20, 15), layout=(6, 6))
plt.tight_layout()
plt.show()
No description has been provided for this image
The dataset is slightly imbalanced. Most features seem to follow a normal distribution (perhaps with a positive skewness). For the rest it's feasible to apply a log-transform.
InΒ [723]:
skewness_arr = df.skew().sort_values(ascending=False)
skewness_arr = skewness_arr[skewness_arr > 2]
print(skewness_arr)

skewness_arr = skewness_arr.index.tolist()

# Add 1 to every skewed data since log(0) is undefined and log(epsilon) is large
df[skewness_arr] = df[skewness_arr].apply(lambda x: np.log(x + 1))
df[df.columns.difference(['diagnosis'])] = preprocessing.StandardScaler().fit_transform(df[df.columns.difference(['diagnosis'])])
area_se                 5.447186
concavity_se            5.110463
fractal_dimension_se    3.923969
perimeter_se            3.443615
radius_se               3.088612
smoothness_se           2.314450
symmetry_se             2.195133
dtype: float64
Select features with a large positive skewness to apply log-transform.
After that, standardize every feature (except diagnosis) with StandardScaler().
InΒ [724]:
df.hist(bins=15, figsize=(20, 15), layout=(6, 6))
plt.tight_layout()
plt.show()
No description has been provided for this image
InΒ [725]:
# Plot the correlation matrix
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5).figure.set_size_inches(20, 10)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image
The eventual diagnosis correlates well with the "size" (perimeter, radius, area) and concavity of the nucleus. Generally, the dataset seems to be highly correlated, altough there are several irrelevant correlations between features that measure roughly the same thing (mean, worst, se for every feature).
InΒ [726]:
# Plot pairplot for a subset of features
sns.pairplot(
    df[["diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean",
        "smoothness_mean", "compactness_mean", "concavity_mean", "concave points_mean", "symmetry_mean",
        "fractal_dimension_mean" ]],
    hue = "diagnosis",
    palette={1: 'orange', 0: 'blue'}
)
plt.show()
No description has been provided for this image
Overall, the data is well separated. The only features that are not well-separated are fractal_dimension, symmetry, smoothness and texture.
We can also observe that the larger the nucleus is (measured by the radius, perimeter and area) there is a higher probability of having malignant breast cancer.
Concavity shows similar properties but there are a few outliers.
It is clear from the EDA that the dataset is of good quality: no missing data, well-separated, the features are well-distributed and it is only slightly imbalanced.
The only thing that might be concerning is the large number of features which results in the dataset having a high dimensionality. However, we will address this issue later with dimension reduction.



Training ΒΆ

In general, we will use cross-validation and a Grid/Random search for hyperparameter optimalization.
Several models are featured including simple (e.g.: KNN) and more robust ones as well (e.g.: Neural Networks, Random Forest etc.).
The following models are tested: KNN, Neural Networks, Logistic Regression and Random Forest.
F1-score is used as a metric since both false positives and false negatives are important.
InΒ [727]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import pprint
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

class ModelTrain:
    def __init__(self, df):
        X = df.copy()
        X = df.drop('diagnosis', axis=1)
        y = df['diagnosis']

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        self.X_train = X_train
        self.X_test = X_test
        self.y_train = y_train
        self.y_test = y_test
        self.X = X
        self.y = y
        self.cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
    
    def __train(self, model):
        model.fit(self.X_train, self.y_train)
        best_model = model.best_estimator_
        return best_model, best_model.predict(self.X_test)
    
    def __grid_search(self, model, param_grid):
        return GridSearchCV(
            estimator = model,
            param_grid = param_grid,
            cv = self.cv,
            scoring = 'f1',
            n_jobs = -1
        )

    def __res(self, best_model, y_pred):
        return (best_model,
            classification_report(self.y_test, y_pred, output_dict=True, target_names=['B', 'M']),
            classification_report(self.y_test, y_pred, target_names=['B', 'M']))
  
    def KNN(self):
        knn = KNeighborsClassifier()

        param_grid = {
            'n_neighbors': [3, 5, 7, 9, 11],
            'weights': ['uniform', 'distance'],
            'metric': ['euclidean', 'manhattan', 'minkowski']
        }

        grid_search_model = self.__grid_search(knn, param_grid)
        best_model, y_pred = self.__train(grid_search_model)
        return self.__res(best_model, y_pred)
    
    def NeuralNetwork(self):
        mlp = MLPClassifier(random_state=1, max_iter=10000)

        param_grid = {
            'hidden_layer_sizes': [(8,), (16,), (32,), (16, 16)],
            'activation': ['tanh', 'relu'],
            'solver': ['adam', 'sgd'],
            'learning_rate_init': [0.001, 0.01, 0.1],
            'alpha': [0.0001, 0.001, 0.01],
        }

        grid_search_model = self.__grid_search(mlp, param_grid)
        best_model, y_pred = self.__train(grid_search_model)
        return self.__res(best_model, y_pred)
   
    def LogisticRegression(self):
        logreg = LogisticRegression()
        logreg.fit(self.X_train, self.y_train)
        y_pred = logreg.predict(self.X_test)
        return self.__res(logreg, y_pred)
   
    def RandomForest(self):
        rf = RandomForestClassifier(random_state=42)

        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'bootstrap': [True, False],
        }

        grid_search_model = self.__grid_search(rf, param_grid)
        best_model, y_pred = self.__train(grid_search_model)
        return self.__res(best_model, y_pred)

mt = ModelTrain(df)

KNN ΒΆ

InΒ [728]:
best_model_knn, _, report_display = mt.KNN()
print(report_display)
              precision    recall  f1-score   support

           B       0.96      0.97      0.97        71
           M       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Neural Network ΒΆ

InΒ [729]:
best_model_nn, _, report_display = mt.NeuralNetwork()
print(report_display)
              precision    recall  f1-score   support

           B       0.97      0.99      0.98        71
           M       0.98      0.95      0.96        43

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Logistic Regression ΒΆ

InΒ [730]:
best_model_logres, _, report_display = mt.LogisticRegression()
print(report_display)
              precision    recall  f1-score   support

           B       0.99      0.99      0.99        71
           M       0.98      0.98      0.98        43

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

Random Forest ΒΆ

InΒ [731]:
best_model_ranfor, _, report_display = mt.RandomForest()
print(report_display)
              precision    recall  f1-score   support

           B       0.96      0.97      0.97        71
           M       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

After training and testing several models we can conclude that all of them perform pretty well on the dataset, achieving a precision > 95%.
However, Logistic Regression stands out from the tested models, achieving the highest score of 99%. Surprisingly, a simple model such as KNN beats a much more complex Neural Network classifier.



Further Questions ΒΆ

Now, we look at a few further questions worth investigating.

Which features are the most important? ΒΆ

InΒ [732]:
coefficients = pd.DataFrame(best_model_logres.coef_.flatten(), mt.X.columns, columns=['Coefficient'])
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y=coefficients.index, data=coefficients, hue=coefficients.index, palette='viridis', legend=False)
plt.title('Feature Importances from Logistic Regression')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
No description has been provided for this image
By extracting Logistic Regression's coefficient matrix we can see which features were the most important (aka had the largest coefficients) in the linear combination.
Among the most relevant features are texture, radius, area and perimeter.
InΒ [733]:
cols = list(df.columns)
cols.remove('diagnosis')
# for pair in sorted(zip(cols, best_model_ranfor.feature_importances_), key=lambda x: x[1], reverse=True):
#     # print(f"{pair[0]}: {round(pair[1], 3)}")
#     pass

feature_importances = best_model_ranfor.feature_importances_
feature_df = pd.DataFrame({
    'Feature': cols,
    'Importance': feature_importances
})

feature_df = feature_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_df, hue='Feature', palette='viridis', legend=False)
plt.title('Feature Importances from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
No description has been provided for this image
We can use the Random Forest model's .feature_importances_ property to see which features correlate the most with having malignant breast cancer.
The top features are concavity, area, radius and perimeter and this aligns well with the biological interpretation since large radius and irregular texture often indicate malignancy because malignant tumors tend to grow and invade surrounding tissue.
We can observe that both Logistic Regression and Random Forest consider roughly the same features as important. However, the coefficients in Logistic Regression can be negative (since they are just real numbers in the linear combination) while the feature importance in Random Forest is always nonnegative (since it does not measure a linear relationship but a contribution of a feature to reducing impurity).

Messing with the dataset (aka which model is the best with a bad quality dataset?) ΒΆ

As we concluded earlier, the dataset has several properties that make it easy to train and work with. Thus, a natural question arises: what if the data is imprecise and bad quality?
So we deliberately made the initial dataset imperfect and tested a few models on the new data.
InΒ [734]:
noisy_df = df.copy()
noise_level = 0.05
numeric_columns = noisy_df.loc[:, noisy_df.columns != 'diagnosis']

for col in numeric_columns:
    col_range = noisy_df[col].max() - noisy_df[col].min()
    noise = np.random.uniform(-noise_level * col_range, noise_level * col_range, size=noisy_df.shape[0])
    noisy_df[col] += noise
First, we added 5% of uniformly generated random noise to each feature, proportional to its range.
InΒ [735]:
columns_to_keep = [col for col in noisy_df.columns if col.endswith('_mean')]
X_noisy = noisy_df[columns_to_keep]
X_noisy.head()
Out[735]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean
0 0.924103 -1.757231 1.011019 1.260656 1.593912 3.291032 2.719108 2.652186 2.458015 2.340878
1 2.000094 -0.335667 1.468144 1.975012 -0.445580 -0.404057 -0.287819 0.735925 0.313804 -1.108398
2 1.559854 0.478012 1.695131 1.704445 0.646808 1.118353 1.291260 2.141907 0.644783 -0.456402
3 -0.858307 0.296378 -0.606947 -0.584121 3.270515 3.616625 2.147613 1.220999 2.864877 4.855046
4 1.916440 -1.204862 1.785224 1.535342 0.337097 0.664072 1.503881 1.287655 -0.124872 -0.287465
Then all the features were removed except those of "mean" type. Our hypothesis was that the mean captures information about the feature well enough so "se" and "worst" are unnecessary.
InΒ [736]:
correlation_matrix_selected = df[columns_to_keep].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix_selected, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix for Mean Features')
plt.show()
No description has been provided for this image
Plotting the correlation matrix again, we see that the data is still highly correlated.
InΒ [737]:
df_with_missing = noisy_df.copy()
missing_percentage = 0.3  # 30% missing values
columns_to_modify = df_with_missing.loc[:, df_with_missing.columns != 'diagnosis']

for col in columns_to_modify:
    num_missing = int(len(df_with_missing[col]) * missing_percentage)
    missing_indices = np.random.choice(df_with_missing.index, size=num_missing, replace=False)
    df_with_missing.loc[missing_indices, col] = np.nan

df_with_missing.head()
Out[737]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 1 0.924103 NaN 1.011019 NaN 1.593912 3.291032 2.719108 NaN NaN ... 2.082082 NaN 2.007561 1.722426 1.635985 2.732424 1.953923 2.424034 2.748584 2.120955
1 1 2.000094 -0.335667 1.468144 NaN -0.445580 -0.404057 -0.287819 NaN 0.313804 ... 1.947317 -0.520197 1.622273 1.534513 NaN NaN -0.057519 NaN -0.606162 0.516005
2 1 1.559854 0.478012 NaN 1.704445 NaN 1.118353 1.291260 2.141907 0.644783 ... 1.775112 NaN 1.621481 1.660115 NaN NaN 0.902021 2.038613 0.883453 NaN
3 1 -0.858307 0.296378 -0.606947 -0.584121 3.270515 NaN 2.147613 NaN NaN ... -0.217726 -0.124007 -0.142061 -0.702994 3.375338 3.976404 1.822231 2.215575 5.789378 4.690161
4 1 1.916440 -1.204862 1.785224 1.535342 NaN 0.664072 NaN 1.287655 -0.124872 ... 1.433691 -1.239162 1.309944 0.997753 0.503924 NaN NaN 0.826369 -0.931863 -0.539437

5 rows Γ— 31 columns

Next, we replace 30% of the exsiting data with NaN values so that it resembles more to a real-life dataset.
InΒ [738]:
imp = IterativeImputer(max_iter=10, random_state=0)
df_with_missing[:] = imp.fit_transform(df_with_missing)
df_with_missing = pd.DataFrame(df_with_missing, columns=df.columns)
rows_to_drop = df.sample(frac=0.3, random_state=42).index
df_with_missing = df.drop(rows_to_drop)
df_with_missing.head()
Out[738]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
1 1 1.829821 -0.353632 1.685955 1.908708 -0.826962 -0.487072 -0.023846 0.548144 0.001392 ... 1.805927 -0.369203 1.535126 1.890489 -0.375612 -0.430444 -0.146749 1.087084 -0.243890 0.281190
3 1 -0.768909 0.253732 -0.592687 -0.764464 3.283553 3.402909 1.915897 1.451707 2.867383 ... -0.281464 0.133984 -0.249939 -0.550021 3.394275 3.893397 1.989588 2.175786 6.046041 4.935010
4 1 1.750297 -1.151816 1.776573 1.826229 0.280372 0.539340 1.371011 1.428493 -0.009560 ... 1.298575 -1.466770 1.338539 1.220724 0.220556 -0.313395 0.613179 0.729259 -0.868353 -0.397100
5 1 -0.476375 -0.835335 -0.387148 -0.505650 2.237421 1.244335 0.866302 0.824656 1.005402 ... -0.165498 -0.313836 -0.115009 -0.244320 2.048513 1.721616 1.263243 0.905888 1.754069 2.241802
7 1 -0.118517 0.358450 -0.072867 -0.218965 1.604049 1.140102 0.061026 0.281950 1.403355 ... 0.163763 0.401048 0.099449 0.028859 1.447961 0.724786 -0.021054 0.624196 0.477640 1.726435

5 rows Γ— 31 columns

Next, we used IterativeImputer to fill the missing values. Also randomly drop 30% of the data.
InΒ [739]:
def i_to_name(i):
    if i == 0:
        return "Neural Network"
    elif i == 1:
        return "Logistic Regression"
    return "Random Forest"

results = []

for _ in range(5):
    num_columns_to_select = np.random.randint(2, 5)
    random_columns = np.random.choice(df_with_missing.loc[:, df_with_missing.columns != 'diagnosis'].columns, size=num_columns_to_select, replace=False)
    random_columns = np.append(random_columns, 'diagnosis')
    df_sel = df[random_columns]
    mt_noisy = ModelTrain(df_sel)
    models = [mt_noisy.NeuralNetwork, mt_noisy.LogisticRegression, mt_noisy.RandomForest]

    print("Training with the following features:", df_sel.columns)
    
    for i in range(len(models)):
        best_model, report, report_display = models[i]()
        f1_benign = report['B']['f1-score']
        f1_malignant = report['M']['f1-score']
        f1_avg = (f1_benign + f1_malignant) / 2
        
        results.append({
            'features': ', '.join(df_sel.columns),
            'model': i_to_name(i),
            'f1_score': f1_avg
        })

df_results = pd.DataFrame(results)

plt.figure(figsize=(10, 6))
sns.barplot(data=df_results, x='features', y='f1_score', hue='model')

plt.xticks(rotation=45, ha='right')
plt.xlabel('Feature Subset')
plt.ylabel('F1 Score')
plt.title('Model Performance by Feature Subset')

plt.tight_layout()
plt.show()
Training with the following features: Index(['texture_se', 'area_se', 'compactness_mean', 'diagnosis'], dtype='object')
Training with the following features: Index(['area_worst', 'compactness_se', 'symmetry_se', 'perimeter_mean',
       'diagnosis'],
      dtype='object')
Training with the following features: Index(['texture_worst', 'perimeter_mean', 'diagnosis'], dtype='object')
Training with the following features: Index(['concave points_mean', 'radius_se', 'compactness_worst',
       'texture_worst', 'diagnosis'],
      dtype='object')
Training with the following features: Index(['texture_mean', 'area_se', 'fractal_dimension_mean', 'diagnosis'], dtype='object')
No description has been provided for this image



Conclusion ΒΆ

At the end we make the following remarks:
  • The dataset is outstandingly good: even simple models such as KNN achieves a 95%+ result, while more complex models are close to 100%.
  • There are only a handful of relevant features that actually correspond to the classification: radius, area, perimeter and concavity. This also aligns well with the biological interpretation since the larger the nucelus is the more probable it is that the patient has breast cancer (since malignant cells tend to invade their surroundings).
  • Even after deliberately deteriorating the quality of the dataset the tested models still achieved 90%+ accuracy (with f1-score).